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The polar decomposition of an m x n matrix A of full rank, where m > n, can 
be computed using a quadraticaily convergent algorithm of Higham [SIAM J. Sci. Stat. 
Comput., 7 (1986), pp. 1160-1 174]. The algorithm is based on a Newton iteration involv- 
ing a matrix inverse. We show how with the use of a preliminary complete orthogonal 
decomposition the algorithm cm be extended to arbitrary A. We also describe how to 
use the algorithm to compute the positive semi-definite square root of a Hermitian pos- 
itive semi-definite matrix. We formulate a hybrid algorithm which adaptively switches 
from the matrix inversion based iteration to a matrix multiplication based iteration due 
to Kovarik, and to Bjorck and Bowie. The decision when to switch is made using a 
condition estimator. This “matrix multiplication rich” algorithm is shown to be more 
efficient on machines for which matrix multiplication can be executed 1.5 times faster 
than matrix inversion. 
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1. Introduction 

A polar decomposition of a matrix A 6 C mXn is a factorization A = UH, where 
H £ C nxn is Hermitian positive semi-definite and U £ C mXn is unitary; here we 
define unitary to mean that U has orthonormai rows or columns according as m < n 
or m > n. The decomposition always exists, H is the unique Hermitian positive semi- 
definite square root of A* A (i.e. H = (A* A) 1 / 2 ), and U is unique if and only if A has 
full rank (these properties are proved in section 2). 

The polar decomposition is well-known in the case m > n; see [7, 10] for example. 
We have followed Horn and Johnson [13] in extending the definition to m < n. The 
consistency of the definition can be seen in the result that for any m and n the uni- 
tary polar factor U is a nearest unitary matrix to A in the Frobenius norm (this is a 
straightforward extension of a result from [6]). Because of the role it plays in solving 
this and other nearness problems, computation of the polar decomposition is required 
in several applications [12]. A recent application, which motivated the work here, is the 
computation of block reflectors (generalizations of Householder matrices) [17]. Here, 
the polar decomposition of an arbitrary matrix must be computed, and it is desirable 
to do this efficiently on vector and parallel computers. 

The polar decomposition can be obtained directly from the singular value decom- 
position (SVD). Higham [10] describes an alternative approach based on a Newton 
iteration involving a matrix inverse. The iteration is defined for square, nonsingular 
matrices only, but in [10] it is pointed out how a preliminary QR decomposition enables 
the treatment of A € C mXn with m>n and rank(A) = n. It is also shown in [10] how 
the iteration can be used to compute the square root of a Hermitian positive definite 
matrix. According to the traditional model of computational cost based on operation 
counts, the iterative algorithm is generally of similar expense to the SVD approach, but 
is much more efficient when the matrix is nearly unitary. In an attempt to improve the 
performance of the iterative algorithm on machines which execute matrix multiplication 
at high efficiency, Schreiber and Parlett [17] propose the use of an inner Schulz iteration 
to compute most of the matrix inverses; they show that this leads to an increase in 
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efficiency if matrix multiplication can be done at a rate 6.8 times faster than matrix 
inversion. 

The purpose of this work is two-fold. First, we extend the algorithm of [10] so that 
it is applicable to arbitrary A. Our technique is to use an initial complete orthogonal 
decomposition so as to extract an appropriate square, nonsingular matrix. One might 
say that the complete orthogonal decomposition is to the polar decomposition what 
Chan’s preliminary QR factorization is to the SVD! We also show how to use the 
algorithm of [10] to compute the square root of a (singular) Hermitian positive semi- 
definite matrix. Second, we introduce a modification of the Schulz inner iteration idea 
of [17] which reduces the cutoff ratio of multiplication speed to inversion speed from 6.8 
to 2, or to 1.5 if advantage is taken of a symmetric matrix product. 
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2. Iterative Polar Decomposition of an Arbitrary Matrix 

The basic algorithm of [10] is as follows. It converges quadratically for any square, 
nonsingular A. We use a MATLAB-like algorithmic notation, and denote by A ~ * the 
conjugate transpose of A -1 . 


Algorithm 2.1: [U,H] = polar. square (A, S) 

% Input arguments: square, nonsingular A; convergence tolerance S. 
% Output arguments: U , H. 

X 0 = A; k = -1 
repeat 

k = k + 1 

ik = (l|r 4 - 1 |lil^*- 1 lloo/(l|r»||,||jr t |U)) ,/4 

Xk+i = 1 (rkXk + 
untU||X t+1 -X ( || 1 <«p t+1 || 1 

v = x t+1 

H = \{XJ*A + A*JJ) | 


To adapt the algorithm to arbitrary A € C mXn we begin by computing a complete 
orthogonal decomposition (COD) 


A = P 


R 0 
0 °. 


Q\ 


where P 6 C mXm and Q € C nXn are unitary, and R € C rXr is nonsingular and upper 
triangular (we exclude the trivial case A = 0, for which R is empty). This decomposition 
may be computed using a QR factorization with column pivoting followed by a further 
Householder reduction step; see [8, p.169] for the details. Now we apply Algorithm 2.1 
to R, obtaining R = UrHr, and we “piece together” the polar factors of A. We have 


A = P 


= P 
= P 


'UrHr o' 

0 0 

'Ur 0 

0 Im—r,n 

§u R 0 

0 Im—r,n 


Q* 


—r 


— r 


][’■ 

\q’-Q 


Hr 

0 


0 

0 


= UH, 


Q* 
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where / m _ r ,n-r denotes the (m— r)x(n— r) identity matrix. Note that I m -r,n-r could be 
replaced by any unitary matrix of the same dimensions; this shows the non-uniqueness 
of the unitary polar factor when r = rank(A) < min(m,n). Note also that even though 
U*U ^ I when m < n, H — U*UH for all m and n; thus A* A = HU'UH — H 2 , so 
that H = (A*A) 1 ! 2 . 

In evaluating U and H advantage can be taken of the zero blocks in the products. 
Denoting by Q\ the first r columns of Q we have 

(2.1) H = Q 1 HrQ* 1 . 

For U we partition 

r m— r 

p= (Pi, p * ) 

and we distinguish the two cases 

\U R 0 1 r T 

(2.2 a)m>n =* U = [P u P 2 ] [7 n _ r j Q* = [P X U R , P 2 ln ~ r ]Q\ 

[ 0 J J L u J 

(2.2 b)m<n =* U = [P 1 ,P 2 ] U « Q] Q* = [PiU R , P 2 [/ m _ r , 0 }}Q*, 

in which, respectively, the last m — n columns of P 2 and the last n — m rows of Q* need 
not participate in the multiplication. 

To summarise, we have the following algorithm: 

Algorithm 2.2: [U, H] = polar (A, e, 5) 

% A ^ 0 is arbitrary. 

[P,if,Q] = C0D(4,€) 

[Ur, Hr] = polar.square(P, S) 

Form U, H according to (2.1) and (2.2). I 

As the notation indicates, in floating point arithmetic a tolerance e is required for 
the complete orthogonal decomposition in order to determine a numerical rank (i.e. the 
dimension of R ). The natural approach is to set to zero all rows of the trapezoidal 
QR factor of A which are negligible (in some measure) relative to e (see [8, p.166]). 
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The choice of e is important, since a small change in e can produce a large change 
in the computed U when A is rank-deficient. However, a redeeming feature is that 
whatever the choice of e, and irrespective of how well the QR factorization reveals rank, 
Algorithm 2.2 is stable, that is, the computed polar factors U , H satisfy 

UH = A + E, 

where ||.E||f ~ max{e, 5}||A||f; this follows from the empirical stability of Algorithm 2.1 
(see [10]) together with the stability of the additional orthogonal transformations in 
Algorithm 2.2. 

The operation count of Algorithm 2.2 breaks down as follows, using the “flop” 
notation [8, p.32]. The complete orthogonal decomposition requires 2 mnr — r 2 (m + n) + 
2r 3 /3 + r 2 (n — r) flops [8, pp.165,170]. Algorithm 2.1 requires, typically, 8 iterations 
(assuming 8 « 10~ 16 ), and hence (7+ |)r 3 flops (taking into account the triangularity of 
R). And formation of H and U requires at most nr 2 + n 2 r/2 and mr 2 + max{ nm 2 , mn 2 } 
flops respectively. By comparison, computing a polar decomposition via the Golub- 
Reinsch SVD algorithm requires approximately 8mn 2 +25 n 3 /6 flops when m > n. The 
Golub-Reinsch SVD algorithm does not take advantage of rank- deficiency, although it 
could be modified to do so by using an initial complete orthogonal decomposition as 
above. 

Of course, operation counts are not always a reliable guide to the actual compu- 
tational cost on modern vector and parallel computers. An alternative performance 
indicator is the amount of matrix multiplication in an algorithm, since matrix multipli- 
cation can be performed very efficiently on many modern computers [1, 2, 16, 18]. As 
we will see in the next section, Algorithm 2.1 can be modified so that it is rich in matrix 
multiplication. In the complete orthogonal decomposition in Algorithm 2.2 the second 
Householder reduction step can be accomplished using the matrix multiplication rich 
WY representation of [2, 18]. In the initial QR factorization effective use of the WY 
representation is precluded by the column pivoting. One alternative is to use Bischof’s 
local pivoting and incremental condition estimation technique [3], which does not hin- 
der exploitation of the WY form. Another alternative is to compute a QR factorization 
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without pivoting, and then to apply Chan’s post-processing algorithm [5] for obtaining 
a rank-revealing QR factorization. 

Finally we show how to use Algorithm 2.1 to compute the Hermitian positive semi- 
definite square root of a Hermitian positive semi-definite A € C nXn . First, we compute 
a Cholesky decomposition with pivoting, 

H T AH = R*R, R=\ Rn Rl2 1 , 

0 0 J 

where R\\ € C rXr is nonsingular and upper triangular. Then Householder transforma- 
tions are used to zero R\ i (as in the complete orthogonal decomposition): 

U*U T AILU = ^ [Tn 0], Tii € C rxr upper triangular. 

Next, Algorithm 2.1 applied to Tn yields T n = UtHt, whence, with Q = H£7, 

Square roots of semi-definite matrices are required in some statistical applications [9], 
An alternative to this polar decomposition approach is to make use of an eigendecom- 
position; the relative merits are similar to those discussed above for the SVD. 
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3. A Hybrid Iteration 

To make Algorithm 2.1 rich in matrix multiplication rather than matrix inversion 
Schreiber and Parlett [17] use an inner Schulz iteration 


(3.1) Zj+i — Zj + (I — ZjX k )Zj, Zo = 


to compute X' t T 1 on all iterations after the first. This approach takes advantage of 
the fact that sirice the X k are converging quadratically, X^\ is an increasingly good 
approximation to X^ 1 . The Schulz iteration (3.1) is a Newton iteration and so also 
converges quadratically. Schreiber and Parlett observe that for the matrices in their 
application (which are often well-conditioned) the typical number of inner iterations 
required for convergence is 6, 5, 3, 2, 1, leading to 17 iterations in total, or 34 matrix 
multiplications. If the matrix inverses were computed directly, five inverses would be 
needed. This suggests that the modified algorithm will be faster than Algorithm 2.1 if 
matrix multiplication can be done at a rate 34/5 times faster than matrix inversion. 

Further experimentation with the inner Schulz iteration led us to feel that it is 
unnecessary to run the inner iteration to convergence, and we considered employing 
just one Schulz iteration, with the starting matrix Zq = X% (« X^ x since X k converges 
to a unitary matrix). Thus the basic iteration 


(3.2) 


x k+l = + ±xr) 


is replaced by (setting j k = 1) 


(3.3) 


x k+1 = i (x t + (z; + z;(i -z»x t y)) 
= x k (i+\(i-xix k j). 


This is precisely the quadratically convergent iteration of Kovarik [14] and Bjorck and 
Bowie [4] for computing the unitary polar factor! Hence, just a single inner Schulz 
iteration is enough to maintain quadratic convergence. 

For (3.3) it holds in any norm for which ||/|| = 1 that 


l|v : +1 v * +1 - 1|| < \\x;x t - if if \\x;x 0 - f|| < i. 
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(As this suggests, the asymptotic error constant is 1 for (3.3) compared with 1/2 for 
(3.2) [10]). To maximise the number of matrix multiplications we therefore need to 
switch from iteration (3.2) to iteration (3.3) as soon as the convergence condition 

( 3 . 4 ) \\x* k x k -i\\<o<i, 

is satisfied; to ensure fast convergence 6 should not be too close to 1. As explained 
below, typically '(3.4) is satisfied for k = 3 with 6 = 0.6 (and obviously for k — 0 if 
Xo = A happens to be nearly unitary). Rather than expend a matrix multiplication 
testing (3.4) we can use the matrix norm estimator CONEST from [11]. This computes 
a lower bound for ||C||i by sampling several matrix-vector products Cx and C*x; thus 
we can estimate — J||i, without forming X%Xk, in 0(r 2 ) flops (for r x r Xk). 

A suitable way to use the estimate is to test whether it is less than A 9, where A < 1. 
If so, X£Xk — I is formed, in preparation for (3.3), and its norm is taken. If (3.4) is 
satisfied then (3.3) is used — otherwise we revert to iteration (3.2). The optimum choice 
of A depends on the desired bias between wasting a matrix multiplication in an abortive 
switch of iteration, and not switching soon enough. The estimate from CONEST is 
almost always correct to within a factor 3, so A > 1/3 is appropriate. In practice we 
have found that the performance of the algorithm is fairly insensitive to the choices of 
9 and A. 

To summarise, our hybrid inversion/multiplication algorithm is as follows. 
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Algorithm 3.1: [U, H ] = polar.mult (A, S, A, 0) 

% A must be a square, nonsingular matrix. 

Xo = A; k — —1; \i = 1 ; switched = false 

repeat 

k = k + 1 

if switched 

I l = I-XiX k] /i = ||A||, 

evaluate (3.3) 

else 

H = CONEST(J - X* k X k ) 
if n > XO 

evaluate (3.2) 

else 

R = I-X* k X k] n = ||fZ||i 

if n > 9, evaluate (3.2), else evaluate (3.3), switched = true; end 

end 

end 

until fj, < 8 

U = X fc+1 

H = \{U*A + A*U) | 

Since iteration (3.3) requires two matrix multiplications, and iteration (3.2) re- 
quires one inversion, Algorithm 3.1 will be more efficient than Algorithm 2.1 if matrix 
multiplication can be done at twice the rate of matrix inversion; thus, compared with 
using the full inner Schulz iteration, the “cutoff ratio” is 2 instead of 6.8. Moreover, 
if advantage is taken of the symmetry of the second matrix product in (3.3) the cut- 
off ratio is reduced to 1.5. The overall speedup depends on the ratio of inversions to 
multiplications, which in turn depends on the conditioning of the matrix, as discussed 
below. 

All the algorithms mentioned here have been coded and tested in PC-MATLAB [15], 
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running on an IBM PC-AT. For this machine the unit roundoff u R 2 2.22 x 10 -16 . We 
used 9 = .6, A = .75, 8 = y/ru, where r is the dimension of the matrix A in Algorithms 
2.1 and 3.1, and e = max(m, n)|<nl« in the complete orthogonal decomposition, where 
T is the triangular factor from the QR factorization with complete pivoting. 

The following comments summarise our numerical experience, based on a wide 
variety of test matrices. 

• Algorithms 2.1 and 3.1 usually require the same number of iterations. Occa- 
sionally Algorithm 3.1 requires one more iteration due to the larger error constant for 
iteration (3.3). 

• In general, the typical number of iterations for Algorithm 3.1 is 7-9, with the 
switch of iteration on iteration 3 or 4. 

• For well-conditioned matrices (k 2 (A) < 10, say), as are common in certain appli- 
cations (see [12]), Algorithm 3.1 tends to require at most 7 iterations and to switch on 
iteration 1-3. 

We present the results for one representative matrix in detail: A is MATLAB’s 
“gallery(5)”, a 5 x 5 nilpotent matrix. Using Algorithm 3.1 within Algorithm 2.2, the 
numerical rank is diagnosed as 4, and Algorithm 3.1 is presented with a triangular matrix 
having one singular value of order 10 5 and three of order 1. Table 3.1 summarises the 
iteration. The backward error ||A — UH\\i = 4.7u||A||i. 
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Table 3.1 


k 

Kp{Xk+ 1) 

\\x- k x k - j||,t 

Iteration 

Ik 

0 

2.6265E7 

1.1380E10 

(3.2) 

3.1546E-3 

1 

5.3197E2 

2.6233E4 

(3.2) 

8.0931E-3 

2 

4.0886E0 

8.0962E-2* 

(3.3) 


3 

3.9959E0 • 

4.4915E-3 

(3.3) 


4 

4.0000E0 

1.3686E-5 

(3.3) 


5 

4.0000E0 

1.2607E-10 

(3.3) 


6 

4.0000E0 

1.5765E-17 

(3.3) 



f Norm estimate while (3.2) used; exact quantity while (3.3) used. 

* Norm estimate exact to 5 digits. 
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